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Abstract: Many algorithms for approximate nearest neighbor search in high- 
dimensional spaces partition the data into clusters. At query time, in order to 
avoid exhaustive search, an index selects the few (or a single) clusters nearest 
to the query point. Clusters are often produced by the well-known fc-means 
approach since it has several desirable properties. On the downside, it tends 
to produce clusters having quite different cardinalities. Imbalanced clusters 
negatively impact both the variance and the expectation of query response times. 
This paper proposes to modify fc-means centroids to produce clusters with more 
comparable sizes without sacrificing the desirable properties. Experiments with 
a large scale collection of image descriptors show that our algorithm significantly 
reduces the variance of response times, at a slight cost with respect to the trade- 
off between efficiency and search quality. 
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Reduction de la variabilite du temps de reponse 
pour la recherche d'image 

Resume : De nombreux algorithmes de recherche approchee de plus proches 
voisins en grande dimension partitionnent les donnees en clusters. Au mo- 
ment de la requete, pour eviter une recherche exhaustive couteuse, un index 
selectionne un ou plusieurs clusters parmis les plus proches de la requete. Les 
clusters sont souvent obtenus par la methode du /c-means. Un des avantages 
de cette methode est qu'elle tend a produire des clusters de tailles diverses. 
Ce desequilibre entre les cardinalites des clusters a un effet negatif tant sur 
la variance que sur l'esperance du temps de reponse. Cet article propose de 
modifier les centro'ides obtenus par fc-means dans le but de produire des clus- 
ters de tailles comparables. Les experiences effectues sur une grande collection 
d'images decrites montrent que notre algorithme reduit significativement la vari- 
ance du temps de reponse, en diminuant legerement les performances en termes 
de compromis entre efficience et qualite des resultats retournes. 

Mots-cles : recherche de plus proches voisins, grandes bases de donnees, 
distance euclidienne, quantification 
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1 Introduction 

Finding the nearest neighbors of high-dimensional query points still receives a 
lot of research attention as this fundamental process is central to many content- 
based applications. Most approaches rely on some different kinds of partitioning 
of the data collection into clusters of descriptors. At query time, an indexing 
structure selects the few (or a single) clusters nearest to the query point. Each 
candidate cluster is scanned, actual distances to its points are computed and 
the query result is built upon these distances. 

There are various options for clustering points, the most popular being the 
fc-means approach. Its popularity is caused by its nice properties: it is a simple 
algorithm, surprisingly effective and easy to implement. It nicely deals with 
the true distribution of data in space by minimizing the mean square error over 
the clustered data collection. On the downside, it tends to produce clusters 
having quite different cardinalities. This, in turn, impacts the performance of 
the retrieval algorithm: scanning heavily filled clusters is costly as the distances 
to many points must be computed. In contrast, under-filled clusters are cheap to 
process, but they are selected less often as the query descriptor is also less likely 
to be associated with these less populated clusters. Overall, having imbalanced 
clusters impact both the variance and the expectation of query response times. 
This is very detrimental to contexts in which performance is paramount, such 
as high-throughput settings where the true resource consumption can no more 
be accurately predicted by costs models. 

This phenomenon has an even more detrimental impact at large scale. In 
this case, clusters must be stored on disks and the performance severely suffer 
when fetching large clusters due to the large I/Os. Furthermore, fc-means is 
known to fail clustering at very large scale, and hierarchical or approximate 
fc-means must be used, which, in turn, tend to increase the imbalance between 
clusters [I]. 

This paper proposes an extension of the traditional fc-means algorithm to 
produce clusters of much more even size. This is beneficial to performances since 
it reduces the variance and the expectation of query response times. Balanc- 
ing is obtained by slightly distorting the boundaries of clusters. This, in turn, 
impacts the quality of results since clusters do not correspond to the initial op- 
timization criterion anymore. Section [2] defines the problem we are addressing 
and introduces the key metrics later used in the evaluation. Section [3] details 
the balancing strategy we propose. Section [4] evaluates the impact of balancing 
on the response time of queries when using large collections of descriptors com- 
puted over I million images from Flickr. It also shows result quality remains 
satisfactory with respect to the original fc-means. Section [5] concludes the paper. 

2 Problem statement 

2.1 Base Clustering and Searching Methods 

Without loss of generality, we partition a collection of high-dimensional feature 
vectors into clusters defining Voronoi cells. We typically use a fc-means algo- 
rithm quantizing the data into fc cells. Each cell stores a list of the vectors 
it clusters. This approach is widely adopted in the context of image searches, 
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where clustering is applied to local [TJJ [T2] or global descriptors [5J [5] . A search 
strategy exploiting this partitioning is usually approximate: only one or a few 
cells are explored at query time. The quality of results is typically increased 
when multiple cells are probed during the search as in [HI IH1 HI E] • The actual 
distances between the query point and the features stored in each such cell are 
subsequently computed [TJ 111] . Therefore, the response time of a query is di- 
rectly related to (i) the strategy used to identify the cells to explore and (ii) 
the total number of vectors used in distance computations. The cost for (i) is 
fixed and mainly corresponds to finding the m p centroids that are the closest 
to the query point (£2)- In contrast, the cost for (ii) heavily depends on the 
cardinality of each cell to process. It is of course linked to m p . Note that (i) is 
often negligible compared to (ii). 

2.2 Metrics: Selectivity and Recall 

All approximate nearest-neighbor search methods try to find the best trade-off 
between result quality and retrieval time. The quality of the results can be 
seen as the probability to retrieve the correct neighbors at search time, given 
the total amount of data that is processed. This can be expressed in terms of 
selectivity and recall defined as: 

• Selectivity is the total rate of vectors used in the distance calculations 
(with respect to the whole data collection) . Obviously, the larger selectiv- 
ity, the more costly is (ii). 

• Recall is, for a query, the total rate of nearest neighbors correctly iden- 
tified (with respect to the above selectivity). This measurement is called 
precision in |11| . but recall is more accurate here. Observe that if the true 
nearest neighbor is found within any of the selected cells then it will be 
ranked first in the result list. 

2.3 Imbalance Factor 

As in [1], we measure the imbalance between the cardinalities of the clusters 
resulting from a fc-means using an imbalance factor 7 defined as: 



where pi is the probability that a given vector is stored in the list associated with 
the i th cluster. For a fixed dataset of size N, this factor is empirically measured 
based on the number rii w pi N of descriptors associated with each list. As 
shown in [3], for m p = 1 and for a fixed k, the measure 7 of the balancing is 
directly related to the search cost: a measure 7 = 3 means that the expectation 
of the search time is three times higher than the one associated with perfectly 
balanced clusters. Optimal balancing is obtained when n, = n opt = N/k for all 
i. In that case, 7=1 (lowest possible value) and the variance of query time 
is zero, as any cell contains exactly the same number of elements. This clearly 
appears in the analytical expression of the variance of the number of elements 
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3 Balancing Clusters 
3.1 The Balancing Process 

Balancing clusters is an iterative post-processing step performed on the final 
output of a fc-means type-of algorithm. The idea is to artificially enlarge the 
distances between the data points and the centroids of the heavily filled clus- 
ters. These penalties applied to distances depend on the population of clusters. 
Hence, the contents of cells and thus their population can be recomputed ac- 
cordingly. This balancing process eventually converges to equally filled clusters. 
The penalties are called penalization terms and are computed as follows: 



where a controls the convergence speed. A small value for a indeed ensures 
that balancing will be done in a smooth way, while it implies to iterate more in 
order to get even cell population. Note that, at each iteration I. the populations 
n\ are updated in order to take these penalization terms into account. More 
precisely, distances from any point x to the i th centroid are computed as 



3.2 Geometrical Interpretation 

A geometrical interpretation of the balancing process described above is possible. 
Assume the balancing process first embeds the fc-means clustered d-dimensional 
vectors into a (d + l)-dimensional space. In this space, their d first components 
are the ones they had in their original space, while component d + 1 is set 
to zero for all vectors. Centroids are also embedded in the same way, except 
for their last component. This last value for centroid i is set to y/lq. Then, 
while the balancing procedure iterates, it is set to the appropriate y^W t value. 
The intuition is that centroids are artificially elevated in an iterative manner 
from the hyperplane where vectors lie. The more vectors in one cluster, the 
more elevation its centroid gets. This is illustrated in Figure [TJ where the z- 
axis corresponds to the added dimension. Along iterations, the updated vector 
assignments are computed with respect to the coordinates of the points lying 
in the augmented space. The artificial elevation of centroids tends to shrink 
the most populated clusters, dispatching some of its points in neighboring cells. 
Figure [5] exhibits the influence of the (d + l) th coordinate of the centroids on 
the position of the borders. 

3.3 Partial Balancing 

The proposed balancing strategy empirically converges towards clusters having 
the same size. Several stopping criteria can be applied, the most simple being 




(3) 
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Figure 1: Data points and centroids embedded in a 3-d example. Data points 
are plotted as dots while centroids are represented as crosses, with a non-null 
z-axis value after some iterations. 



i 

-*- 



i 
i 
i 

-*- 



i 
i 

-*- 



Figure 2: Voronoi cell boundaries shifted after some iterations. New boundaries 
are plotted as dashed red lines which shrink the central cluster because of its 
large population. 



a fixed maximum number of iterations. It is also possible to target a particular 
value for 7 which is recomputed at every step, either fixed or possibly in pro- 
portion of the original imbalance factor. Early stopping the balancing reduces 
the overall distortion of the Voronoi cells created by the original fc-means. 



4 Experiments 

4.1 Datasets and Imbalance Factors Analysis 

Our analysis has been performed on descriptors extracted from a large set of 
real- world images. We downloaded from Flickr one million images to build 
the database and another set of one thousand images for the queries. Several 
description schemes were applied to these images, namely SIFT local descrip- 
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fc=256 


fc=1024 


SIFT 


128 


1.08 


1.09 


BOF 


1000 


1.65 


1.93 


GIST 


960 


1.72 


3.75 


VLAD 


8192 


5.41 


6.23 



Table 1: Imbalance factor for A:- means clustered state-of-the-art descriptors, 
measured on a dataset of one million images for two values of k. 

tors [7], Bag-of-features [H] (BOF), GIST [13] and VLAD descriptors [5J. SIFT 
were extracted from Hessian- Affine regions [TU] using the software of [5J. The 
BOF vectors have been generated from these local descriptors, using a codebook 
obtained by regular fc-means clustering with 1000 visual words. The VLAD de- 
scriptors were generated using a codebook of 64 visual words applied to the 
same SIFT descriptors, leading to vectors of dimension 64 x 128 = 8192. For 
GIST, we have used the most common setup, i.e., the three color channels and 
3 scales, leading to 960-dimensional descriptors. 

The global descriptors (BOF, GIST and VLAD) produce exactly one de- 
scriptor per image, leading to one million vectors for each type of descriptor. In 
order to keep the same number of vectors for the SIFT set, we have randomly 
subsampled the local descriptors to produce a million-sized set. In all cases, we 
assume a closed- world setup, i.e., the dataset to be indexed is fixed, which is 
valid for most applications. 

Table Q] reports the imbalance factors obtained for each type of descriptors 
after performing a standard fc-means clustering on our database. It can be ob- 
served that higher dimensional vectors tend to produce higher imbalance factors. 
BOF descriptors have an imbalance factor which is lower than GIST for a com- 
parable dimension, which might be due to their higher sparsity. Note that the 
value of k has a significant impact on 7: larger values of k lead to significantly 
higher 7 (A:=256 and fc=1024). The low values for k we have considered here 
probably explain why 7 measured for the SIFT descriptors in Table Q] are lower 
than those of the literature: Jegou et al. [I] report 1.21 and 1.34 for codebooks 
of size A:=20 000 and fc=200 000, respectively. 

Our balancing strategy is especially interesting for global descriptors for 
which, in contrast to local descriptors, exactly one query vector is used. In 
this case, with perfectly balanced clusters, querying an image is performed in 
constant time. This is the rationale for focusing our analysis on the well known 
BOF vectors. 

4.2 Evaluation of the proposed method 

In this subsection we analyze the impact of our method on selectivity, recall and 
variability of the response time. We also analyze the convergence properties of 
our method. The parameter a is set to a=0.01 in all our experiments. 

Selectivity/recall performance: Figure [3] shows the performance in terms 
of this trade-off for different values of k. First note that the trade-off between 
selectivity and recall can be adjusted using the number k of clusters and the 
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Figure 3: Selectivity/recall performance: impact of partial and full balancing 
on this trade-off. For each value of k, the top-right points correspond to the 
original k- means partition (no iteration). From top to bottom, the points of a 
given curve correspond to 8, 16, 32 and 64 iterations. Similar to choosing a high 
value of k, our method reduces the selectivity (i.e., provides better efficiency) at 
the cost of lower recall. The different trade-off selectivity/recall are obtained, 
for our method, with a significantly lower variability of the response time. 

number m p of probes. We keep the ratio m v /k constant in order to better show 
the impact of our method, which exhibits comparable performance with that of 
the k- means clustering in terms of selectivity and recall. Figures H] and [5] shows 
comparable results when m p is constant. Note however that with our method a 
given selectivity/recall point is obtained with a much better (lower) variability 
of the response time, as shown later in this section. 

Impact of the number of iterations: The number r of iterations performed 
by Equation 13] is an important parameter of our method, as it controls to which 
extent complete balancing is enforced or not. Figure [3] shows that selectivity 
is reduced in the first iterations with a reasonable decrease of the recall, i.e., 
comparable to what we would obtain by modifying the number of clusters. The 
next iterations are comparatively less interesting, as the gain in selectivity is 
obtained at the cost of a relatively higher decrease in recall. Modifying the 
stopping criterion allows our method to attain a target imbalance factor which 
is competitive with respect to the selectivity /recall trade-off. 

Convergence speed: Figure [5] illustrates how the imbalance factor evolves 
along iterations. Only a few iterations are needed to attain reasonably balanced 
clusters. Our update procedure has a computational cost which is negligible 
compared with that of the clustering. Higher values of k do not require more 
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Figure 4: Selectivity/recall performance: impact of partial and full balancing 
on this trade-off for m p = 1. For each balancing strategy, the top-right points 
correspond to small values of k. From top to bottom, k varies between 256 and 
1024. Observe that if, for small values of k, balancing improves the performance 
in terms of this trade-off, for large k, balancing tends to deteriorate this trade- 
off. 

iterations, which is somewhat surprising as more penalization terms have to be 
learned. Note that the convergence of our algorithm is not guaranteed, though 
in all the experiments presented in this paper it has been observed. 

Variance of the query response time: The impact of our balancing strategy 
on the variability of the response time is illustrated by Figure [71 which gives the 
distribution of the number of elements returned by the indexing structure. The 
tight distribution obtained by our method shows that the objective of reducing 
the variability of the query time resulting from unbalanced clusters is fulfilled: 
the response time is almost constant with full balancing. The partial balancing 
also leads to significantly improve the shape of the distribution, which has a 
significantly reduced variance compared with the original one. 
Impact of the choice of descriptors on observed results: In order to 
validate our approach on a different kind of descriptors, we tested it using Fisher 
kernels with 16 gaussians. The query set is the concatenation of the Holidays 
dataset [3] and the UKB one [12] . The results, as shown in figure [5] are strongly 
dependent on the value of k. This is due to the fact that fc-means clustering for 
small values of k leads to well-balanced clusters (7 < 1.1) while k = 1024 reaches 
an imbalance factor of 2.2. In the latter case, balancing shows its efficiency in 
terms of selectivity, as expected. 
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Figure 5: Selectivity/recall performance: impact of partial and full balancing 
on this trade-off for m p = 32. For each balancing strategy, the top-right points 
correspond to small values of k. From top to bottom, k varies between 256 and 
1024. In this case, balancing tend to deteriorate this trade-off. 



4.3 Is closed-world setup mandatory ? 

Previous section presented results obtained in a closed-world setup as it allows 
to achieve quasi-constant query time in all cases. However, figure [5] shows that, 
as soon as distribution of the learning set is reasonably close to the one of the 
database, comparable selectivity-versus-recall compromise can be achieved in 
the open-world case. In this example, the database is the same as the one used 
in the previous experiments. For both closed-world and semiclosed-world setups, 
another 1 million images from Flickr are used as a learning set to train fc-means. 
The different between both setups is that in the semiclosed-world one, balancing 
is learnt on the database itself while in the open-world setup, it is optimized on 
the learning set, which could lead to unbalanced database clusters. 

Note nevertheless that quality of the balancing in semiclosed and open- world 
setups strongly depends on the learning set having comparable distribution to 
the one of the database. Therefore, their usage should be restricted to cases 
where this assumption is likely to be verified, as for example in cases where the 
learning set is a subset of the entire database. 



5 Conclusion 

Many high-dimensional indexing schemes rely on a partitioning of the feature 
space into clusters obtained from a fc-means type-of approach. These schemes 
are efficient because they process a very small number of cluster for answering 
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Figure 6: Convergence speed 
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Figure 7: Histograms of the number of elements returned, computed over our 
1000 queries, for the original k-means and our algorithm with three number of 
iterations. Observe the tightness of the distribution in the case of our method, 
which reflects a very low variability in response time. 
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Figure 8: Selectivity/recall performance: impact of partial and full balancing 
on this trade-off for Fisher kernel descriptors. From top to bottom, the points 
of a given curve correspond to 8, 16, 32 and 64 iterations. 



each query. Their performance suffer, however, from having to process clusters 
with very different cardinalities since this causes great variations in the response 
time to queries. This paper presents an algorithm that iteratively balances clus- 
ters such that they become more equal in size. Reducing the variance and the 
expectation of response times is a key issue when targeting high-performance 
settings, especially when data has to be read from disk. Our experiments 
demonstrated that clusters are better balanced without significantly impact- 
ing the search quality. We are planning to index much data collections where 
the imbalance factor will be higher, as for the promising VLAD descriptors [5], 
increasing the need for a more uniform cluster distribution. 
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